BTLZA
BTLZA: BGBTech LZ77 + Arithmetic BTLZH: BGBTech LZ77 + Huffman Goals: * Deflate-like encoder but with (optional) Arithmetic Coding; * Ideally, sort of like LZMA but cheaper/faster; * Should ideally compress better than Deflate or Deflate + Arithmetic; * Should be mostly drop-in-compatible with Deflate. In cases where a Zlib header is expected by the format, a Faux Zlib Header will be used. Faux Zlib Header * 0xWMFF ** W=Log2 Window Size minus 8 ** M=Method *** 8=Deflate *** 9=Deflate64 *** A=BTLZA / BTLZH *** B=MTFRice * FF=Flags and Check ** 0bLLDCCCCC *** LL=Level, 0=Fastest, 3=Best Compression *** D=Preset Dictionary *** CC=Check Value The check value is a special value which is set such that the header (seen as a 16-bit big-endian value) is a multiple of 31. For methods 8 and 9, the compressed data is solely in Deflate or Deflate64 mode. For method A, the data is in BTLZA Mode. In these modes, the compressed data is to be followed by an Adler32 checksum of the expected output. If the preset dictionary flag is set, the header is followed by an Adler32 checksum which may be used to identify the dictionary. Compact Form Header * 0x8M (Primary) * 0xBM (Alternate) ** Only contains a method number, and is a single byte. ** Does not contain a checksum. ** Default Window Sizes: *** 32kB for Deflate *** 64kB for Deflate64 *** 1MB for BTLZA / BTLZH This header interferes with the coding of a 64kB Window: * Interferes with the normal Zlib header for Deflate64. May only be interpreted as a compact mode header if the Mod31 check fails. * Alternate may be used if the primary would pass the Mod31 check. * Alternate will likewise be subject to the Mod31 check. ** However, it will not occur that both would pass the Mod31 check. ** Alternate is only valid if primary would pass the Mod31 check. Coded Stream Coded Blocks Blocks will be tagged using a 1-bit final-block indicator followed by a 2 bit block-type ID. Bits will be read from bytes starting from the LSB. Block Types will use a 2-bit ID: * 0: Stored Data * 1: Fixed Table * 2: Dynamic Table * 3: Escape Case ** 3-bit secondary type follows. ** 0=Stored Data 2 ** 1=Arithmetic Mode Header ** 2=Dynamic Table 2 ** 3=Fixed Table 2 Block types 0, 1, and 2 will be the same as normal Deflate (as defined in RFC1951). If in Deflate64 Mode, the blocks will be nearly identical, but with a few slight differences: * The dictionary will be extended to 64kB. ** The last 2 entries in the distance table will be used for these lengths. * Symbol 285 will have 16 extra bits. ** It will encode a value from 3 to 65538. Stored Data Stored data is stored in an uncompressed form. * len:WORD, nLen:WORD, data:bytelen ** len: Length of stored data ** nLen: ~len ** data: raw byte data. Data is to be stored byte-aligned. The current behavior of this form in arithmetic mode is undecided. Stored Data 2 Stored data is stored without applying LZ or Huffman compression. These will be stored as a length followed by the data bytes (encoded as sequences of 8 bits). * Len:Dist(6) * Bytes(Len*8) In arithmetic mode, these bytes will be encoded using Model L. The length is encoded using a fixed 6-bit prefix and the distance table (Arithmetic model R). Fixed Huffman Table In this mode, the Huffman table will used fixed code-lengths. * 0-143: 8 bits * 144-255: 9 bits * 256-279: 7 bits * 280-287: 8 bits Fixed Huffman Table 2 Similar to the normal Fixed Huffman Table, except that a table-ID is given: * Hl(3): TableID, Literal (TableID_Lit) * Hd(2): TableID, Distance (TableID_Dist) TableID, Literal: * 0, Same literal table as before, expanded distance table ** 0-143: 8 bits ** 144-255: 9 bits ** 256-279: 7 bits ** 280-287: 8 bits *** Match lengths are limited to 386 bytes. * 1, Expanded A. ** 0-127: 8 bits ** 128-383: 9 bits *** 0xxxxxx: 0-127 8 bit *** 1xxxxxxx: 128-383 * 2, Expanded B. ** 0-255: 9 bits ** 256-383: 8 bits *** 0xxxxxx: 256-383 8 bit *** 1xxxxxxx: 0-255 * 3-6, Reserved * 7, Reuse tables from prior block TableID, Distance: * 0: ** 0-63: 6 bits (Window: Full Range) * 1: ** 0-31: 5 bits (Window: 64kB) * 2: ** 0-15: 4 bits (Window: 256 bytes) * 3: ** 0-7: 3 bits (Window: 16 bytes) TableID_Lit=7 * Reuse Literal and Distance tables from prior block. ** TableID_Dist=0 * For piecewise streams, this may reference tables from the prior stream segment. Dynamic Huffman Table * Hl(5) bits: Number of coded literal symbols (minus 257, 257-288). * Hd(5) bits: Number coded distance symbols (minus 1, 1-32). * Hc(4) bits: Number of header symbols (minus 4, 4-19). ** Hcl(Hc*3): Header symbol code lengths *** Encoded in the following order: **** 16, 17, 18, 0, 8,7, 9,6, 10,5, 11,4, 12,3, 13,2, 14,1, 15 This is followed by the literal and distance tables, which are encoded as such: * 0-15: Code Length * 16: Rc(2), Repeat prior code length 3-7 times. * 17: Zc(3), Code length of 0 for 3 to 10 times. * 18: Zc(7), Code length of 0 for 11 to 138 times. Dynamic Huffman Table 2 Similar to the normal Dynamic Huffman Table, except for a few sizes and similar: * Hr(3): Reserved, must be 0 * Hl(7): Number of coded literal symbols (minus 257, 257-384). * Hd(6): Number coded distance symbols (minus 1, 1-64). * Hc(5): Number of header symbols (minus 4, 4-35). ** Hcl(Hc*3): Header symbol code lengths Arithmetic Mode Header Hr(3): Header Mode * 0: Disable Arithmetic Coding ** Cancels out of arithmetic coding, returns to using raw bits. * 1: Enable Arithmetic Coding (Mode A) ** Hl(4)+12: Size of Context (Model R) ** Hw(5): Dictionary Window Size *** Enables the use of a arithmetic-coding mode. *** In this mode, Huffman coded symbols and extra bits are secondarily fed through a bitwise arithmetic coder. *** No distinction is made between the types of bits (All bits will use a single shared context model). *** Logically, bits are read/written in LSB-first ordering. * 2: Enable Arithmetic Coding (Mode B) ** Hl(5): Size of Literal Symbol Context (Model L) ** Hd(5): Size of Distance Symbol Context (Model D) ** Hx(5): Size of Extra-Bits Value Context (Model X) ** HCx(5): Number of context-encoded Extra-Bits *** After this count, fixed-probability bits will be used instead. *** If given as 31, all bits will use context modeling. ** Hy(5): Size of Raw-Bits Value Context (Model R) ** Hw(5): Dictionary Window Size *** Enables the use of a arithmetic-coding mode. *** In this mode, Huffman coded symbols and extra bits are secondarily fed through a bitwise arithmetic coder. *** Different types of bits will use different contexts. *** Logically, bits are read/written in LSB-first ordering. VLC Run Length A(*) (Prefix, Range, Extra): 257-264, 3- 10, 0 265-268, 11- 18, 1 269-272, 19- 34, 2 273-276, 35- 66, 3 277-280, 67-130, 4 281-284, 131-258, 5 285-288, 259-514, 6 - 289-292, 515- 1026, 7 293-296, 1027- 2050, 8 297-300, 2051- 4098, 9 301-304, 4099- 8194, 10 305-308, 8195-16386, 11 309-312, 16387-32770, 12 313-316, 32771-65537, 13 317, 65538 , 0 ... Run Length B(*) (Prefix, Range, Extra): 257-264, 3- 10, 0 265-268, 11- 18, 1 269-272, 19- 34, 2 273-276, 35- 66, 3 277-280, 67-130, 4 281-284, 131-258, 5 285, 258|3-65538, 0|16 The type of length table used will depend on the block type. * Fixed Huffman 2 and Dynamic Huffman 2 will use A. * Fixed Huffman and Dynamic Huffman will used B. ** The encoding of symbol 285 will depend on the mode: *** Deflate Mode: 258, 0 *** Deflate64 Mode: 3-65538, 16 *** BTLZA Mode: 3-65538, 16 ** Symbols beyond 285 are not used for these block types. Run Distance (Prefix, Range, Extra) 0- 3, 1- 4, 0 4/ 5, 5- 8, 1 6/ 7, 9- 16, 2 8/ 9, 17- 32, 3 10/11, 33- 64, 4 12/13, 65- 128, 5 14/15, 129- 256, 6 16/17, 257- 512, 7 18/19, 513- 1024, 8 20/21, 1025- 2048, 9 22/23, 2049- 4096, 10 24/25, 4097- 8192, 11 26/27, 8193-16384, 12 28/29, 16385-32768, 13 30/31, 32769-65536, 14 - 32/33, 65537-131072, 15 34/35, 131073-262144, 16 ... Extended Runs 318-321, 1- 4, 0 322/323, 5- 8, 1 324/325, 9- 16, 2 326/327, 17- 32, 3 328/329, 33- 64, 4 330/331, 65- 128, 5 332/333, 129- 256, 6 More Generalized Form of this scheme: VLCA (Prefix, Range, Extra): 0- 3, 0- 3, 0 4/ 5, 4- 7, 1 6/ 7, 8- 15, 2 8/ 9, 16- 31, 3 10/11, 32- 63, 4 12/13, 64- 127, 5 14/15, 128- 255, 6 16/17, 256- 511, 7 18/19, 512- 1023, 8 20/21, 1024- 2047, 9 22/23, 2048- 4095, 10 24/25, 4096- 8191, 11 26/27, 8192-16383, 12 28/29, 16384-32767, 13 30/31, 32768-65535, 14 - 32/33, 65536-131071, 15 34/35, 131072-262143, 16 ... VLCB (Prefix, Range, Extra): 0- 7, 0- 7, 0 8-11, 8- 15, 1 12-15, 16- 31, 2 16-19, 32- 63, 3 20-23, 64-127, 4 24-27, 128-255, 5 28-31, 256-511, 6 - 32-33, 512-1023, 7 34-35, 1024-2047, 8 ... VLC-VLN (Prefix, Range, Extra): 0/1, 0- 1, 0 (Zero Run) 2, 2- 3, 1 (Zero Run) 3, 4- 7, 2 (Zero Run) 4, 8- 15, 3 (Zero Run) 5, 16- 31, 4 (Zero Run) 6, 32- 63, 5 (Zero Run) 7, 64- 127, 6 (Value) 8, 128- 255, 7 (Value) 9, 256- 511, 8 10, 512- 1023, 9 11, 1024- 2047, 10 12, 2048- 4095, 11 13, 4096- 8191, 12 14, 8192-16383, 13 15, 16384-32767, 14 - 16, 32768- 65535, 15 17, 65536-131071, 16 18, 131072-262143, 17 ... Some values and arguments may have the sign folded into the LSB: * 0, -1, 1, -2, 2, -3, 3, -4, ... These cases will be called SVLCA and SVLCB. LZ Compression Symbol Ranges * 0-255: Literal Byte Values * 256: End Of Block * 257-285: LZ Runs (Deflate and BTLZH/BTLZA Modes) * 286-317: LZ Runs (BTLZH/BTLZA) * 318-333: Extended Runs ** Extended Run VLC encodes a command-tag. ** 1: Prior Length and Distance *** The run and distance from the last seen run are reused. ** 2: Prior Length, Distance Follows *** Reuse the length from the last seen run. *** A coded distance follows. ** 3: Prior Distance, Length Follows *** Reuse the distance from the prior run. *** A length follows as a literal length prefix and extra bits. * 334-365: (Possible) VLN Value (Possible/Undecided) VLN VLN are VLC-VLN values coded with a dedicated value range (as an optional feature). VLN values will be represented as-bytes with a format resembling UTF-8. * 0x80-0xBF: 0-63 (Single Byte) ** Coded as a literal byte. * 0xC0-0xDF: 64-2047 (2-Byte) * 0xE0-0xEF: 2048-65535 (3-Byte) * 0xF0-0xF7: 65536-2M (4-Byte) * 0xF8-0xFB: 2M-64M (5-Byte) * 0xFC-0xFD: 64M-2G (6-Byte) * 0xFE: 2G-64G (7-Byte) Values < 64 will be coded as a literal byte, rather than a VLN value. VLN values < 64 will instead code a run of zeroes (0x80 bytes). Arithmetic Coder To be filled in more once it stabilizes. Basically, it uses a bit-at-a-time encoder, with a variable context size. The context basically consists of the last N preceding bits, which is used as an index into a probabilities table. After encoding each bit, the context is shifted 1 bit and the encoded bit is added to the LSB (or MSB). This context is masked by the context-size, namely ((1<>8)*weight. * Values above this point (rval>dval) will correspond to a logical 1 bit ** min=bound+1; ** weight=weight-(w>>5); * Values less than or equal to this value will be interpreted as 0 ** max=bound; ** weight=weight+((256-w)>>5); Normalization happens when the high 8 bits of min and max become equal. For encoding: * Write the high 8 bits to the output; ** WriteByte(min>>24); * min=(min<<8); max=(max<<8)|255; For decoding: * min=(min<<8); max=(max<<8)|255; * val=(val<<8)|ReadByte();